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Abstract 

We study data structures in the presence of adversarial noise. We want to encode a given 
object in a succinct data structure that enables us to efficiently answer specific queries about 
the object, even if the data structure has been corrupted by a constant fraction of errors. This 
new model is the common generalization of (static) data structures and locally decodable error- 
correcting codes. The main issue is the tradeoff between the space used by the data structure 
and the time (number of probes) needed to answer a query about the encoded object. We 
prove a number of upper and lower bounds on various natural error-correcting data structure 
problems. In particular, we show that the optimal length of error-correcting data structures for 
the Membership problem (where we want to store subsets of size s from a universe of size n) 
is closely related to the optimal length of locally decodable codes for s-bit strings. 

Keywords: data structures, fault-tolerance, error-correcting codes, locally decodable codes, 
membership problem, length-queries tradeoff 

1 Introduction 

Data structures deal with one of the most fundamental questions of computer science: how can we 
store certain objects in a way that is both space-efficient and that enables us to efficiently answer 
questions about the object? Thus, for instance, it makes sense to store a set as an ordered list 
or as a heap-structure, because this is space-efficient and allows us to determine quickly (in time 
logarithmic in the size of the set) whether a certain element is in the set or not. 

From a complexity-theoretic point of view, the aim is usually to study the tradeoff between 
the two main resources of the data structure: the length/size of the data structure (storage space) 
and the efficiency with which we can answer specific queries about the stored object. To make this 
precise, we measure the length of the data structure in bits, and measure the efficiency of query- 
answering in the number of probes, i.e., the number of bit-positions in the data structure that we 
look at in order to answer a query. The following is adapted from Miltersen's survey [Mil 99] : 

Definition 1 Let D be a set of data items, Q be a set of queries, A be a set of answers, and 
f : D x Q — > A. A (p, e)-data structure for f of length N is a map <j) : D —> {0, 1}^ for which 
there is a randomized algorithm A that makes at most p probes to its oracle and satisfies for every 
q 6 Q and x £ D 

Pi[A^ x \q) = f(x,q)]>l-e. 



* rdewolf@cwi.nl. Partially supported by Veni and Vidi grants from the Netherlands Organization for Scientific 
Research (NWO), and by the European Commission under the Integrated Project Qubit Applications (QAP) funded 
by the 1ST directorate as Contract Number 015848. 
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Usually we will study the case D C {0, l} n and A = {0,1}. Most standard data structures 
taught in undergraduate computer science are deterministic, and hence have error probability 
e = 0. As mentioned, the main complexity issue here is the tradeoff between TV and p. Some data 
structure problems that we will consider are the following: 

• Equality. D = Q = {0,1}™, and f(x,y) = 1 if x = y, f(x,y) = if x ^ y. This is not a 
terribly interesting data structure problem in itself, since for every x there is only one query 
y for which the answer is '1'; we merely mention this data structure here because it will be 
used to illustrate some definitions later on. 

• Membership. D = {x G {0,1}™ : Hamming weight \x\ < s}, Q = [n] := {l,...,n}, and 
f(x,i) = Xi. In other words, x corresponds to a set of size at most s from a universe of 
size n, and we want to store the set in a way that easily allows us to make membership 
queries. This is probably the most basic and widely-studied data structure problem of them 
all [FKS841 lYaoSll IBMRV021 IB,SV02| . Note that for s = 1 this is Equality on logn bits, 
while for s = n it is the general Membership problem without constraints on the set. 

• Substring. D = {0, 1}™, Q = {y G {0, 1}™ : \y\ < r}, f{x,y) = x y , where x y is the |y|-bit 
substring of x indexed by the 1-bits of y (e.g., IOIO0110 = 01). For r = 1 it is Membership. 

• Inner product (LP„, ;r ). D = {0, l} n , Q = {y G {0, l} n : \y\ < r} and f(x, y) = x -y mod 2. 
This problem is among the hardest Boolean problems where the answer depends on at most r 
bits of x (again, for r = 1 it is Membership). 

More complicated data structure problems such as Rank, Predecessor, Nearest neighbor 
have also been studied a lot, but we will not consider them here. 

One issue that the above definition ignores, is the issue of noise. Memory and storage devices 
are not perfect: the world is full of cosmic rays, small earthquakes, random (quantum) events, 
bypassing trams, etc., that can cause a few errors here and there. Another potential source of noise 
is transmission of the data structure over some noisy channel. Of course, better hardware can partly 
mitigate these effects, but in many situations it is realistic to expect a small fraction of the bits in 
the storage space to become corrupted over time. Our goal in this paper is to study error- correcting 
data structures. These still enable efficient computation of f(x,q) from the stored data structure 
<f>(x), even if the latter has been corrupted by a constant fraction of errors. In analogy with the 
usual setting for error-correcting codes [MS771 lvL98j . we will take a pessimistic, adversarial view 
of errors here: we want to be able to deal with a constant fraction of errors no matter where they 
are placed. Formally, we define error-correcting data structures as follows. 

Definition 2 Let D be a set of data items, Q be a set of queries, A be a set of answers, and f : 
D x Q —> A. A (p,8, e)-error-correcting data structure for f of length N is a map <p : D — > {0, 1}^ 
for which there is a randomized algorithm A that makes at most p probes to its oracle and satisfies 

Pr[A y (q) = f(x,q)]>l-s, 

for every q G Q, every x G D, and every y G {0, 1}^ at Hamming distance A(y, <j){x)) < 5N. 
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Definition[T]is the special case of Definition[2]where 5 = 0\j Note that if 5 > then the adversary 
can always set the errors in a way that gives the decoder A a non-zero error probability. Hence the 
setting with bounded error probability is the natural one for error-correcting data structures. This 
contrasts with the standard noiseless setting, where one usually considers deterministic structures. 

A simple example of an efficient error-correcting data structure is for Equality: encode x with 
a good error-correcting code 4>(x). Then N = 0(n), and we can decode by one probe: given y, 
probe <fi(x)j for uniformly chosen j £ [N], compare it with 4>(y)j, and output 1 iff these two bits are 
equal. If up to a 5-fraction of the bits in 4>{x) are corrupted, then we will give the correct answer 
with probability 1 — 5 in the case x = y. If the distance between any two codewords is close to N/2 
(which is true for instance for a random linear code), then we will give the correct answer with 
probability about 1/2 — 5 in the case x ^ y. These two probabilities can be balanced to 2-sided 
error e = 1/3 + 2<5/3. The error can be reduced further by allowing more than one probe. 

We only deal with so-called static data structures here: we do not worry about updating the x 
that we are encoding. What about dynamic data structures, which allow efficient updates as well 
as efficient queries to the encoded object? Note that if data-items x and x' are distinguishable in 
the sense that f(x, q) ^ f(x', q) for at least one query q £ Q, then their respective error-correcting 
encodings 4>(x) and <fi{x') will have distance f2(iV)H Hence updating the encoded data from x to 
x' will require Q(iV) changes in the data structure, which shows that a dynamical version of our 
model of error-correcting data structures with efficient updates is not possible. 

Error-correcting data structures not only generalize the standard (static) data structures (Def- 
inition [I]), but they also generalize locally decodable codes. These are defined as follows: 

Definition 3 A (p, 5, e)-locally decodable code (LDC) of length N is a map (f> : {0, l} n — ► {0, 1}^ 

for which there is a randomized algorithm A that makes at most p probes to its oracle and satisfies 

Pv[A y {i) = Xi] > 1 -e, 

for every i G [re], every x E {0, l} n , and every y 6 {0, 1}^ at Hamming distance A(y, (j>(x)) < 5N. 

Note that a (p, 5, e)-error-correcting data structure for Membership (with s = n) is exactly 
a (p, 5, e)-locally decodable code. Much work has been done on LDCs, but their length-vs-probes 
tradeoff is still largely unknown for p > 3. We refer to [Tre04] and the references therein. 

LDCs address only a very simple type of data structure problem: we have an re-bit "database" 
and want to be able to retrieve individual bits from it. In practice, databases have more structure 
and complexity, and one usually asks more complicated queries, such as retrieving all records within 
a certain range. Our more general notion of error-correcting data structures enables a study of such 
more practical data structure problems in the presence of adversarial noise. 

Comment on terminology. The terminologies used in the data-structure and LDC-literature 
conflict at various points, and we needed to reconcile them somehow. To avoid confusion, let us 
repeat here the choices we made. We reserve the term "query" for the question q one asks about the 
encoded data x, while accesses to bits of the data structure are called "probes" (in contrast, these 

As |BMRV02l end of Section 1.1] notes, a data structure can be viewed as locally decodable source code. With 
this information-theoretic point of view, an error- correcting data structure is a locally decodable combined source- 
channel code, and our results for Membership show that one can sometimes do better than combining the best 
source code with the best channel code. We thank one of the anonymous referees for pointing this out. 

2 Hence if all pairs x,x' £ D are distinguishable (which is usually the case), then <j> is an error-correcting code. 
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are usually called "queries" in the LDC-literature). The number of probes is denoted by p. We use 
n for the number of bits of the data item x (in contrast with the literature about Membership, 
which mostly uses m for the size of the universe and n for the size of the set). We use iV for the 
length of the data structure (while the LDC-literature mostly uses m, except for Yekhanin fYek07] 
who uses N as we do). We use the term "decoder" for the algorithm A. Another issue is that e 
is sometimes used as the error probability (in which case one wants e ~ 0), and sometimes as the 
bias away from 1/2 (in which case one wants e ~ 1/2). We use the former. 

1.1 Our results 

If one subscribes to the approach towards errors taken in the area of error-correcting codes, then 
our definition of error-correcting data structures seems a very natural one. Yet, to our knowledge, 
this definition is new and has not been studied before (see Section [1.21 for other approaches). 

1.1.1 Membership 

The most basic data structure problem is probably the Membership problem. Fortunately, our 
main positive result for error-correcting data structures applies to this problem. 

Fix some number of probes p, noise level 5, and allowed error probability e, and consider the 
minimal length of p-probe error-correcting data structures for s-out-of-n Membership. Let us call 
this minimal length MEM(p, s,n). A first observation is that such a data structure is actually a 
locally decodable code for s bits: just restrict attention to n-bit strings whose last n — s bits are 
all 0. Hence, with LDC(p, s) denoting the minimal length among all p-probe LDCs that encode s 
bits (for our fixed s,S), we immediately get the obvious lower bound 

LDC(p,s) < MEM(p,i,n). 

This bound is close to optimal if s fa n. Another trivial lower bound comes from the observation 
that our data structure for Membership is a map with domain of size B(n,s) := X^i=o u) an< ^ 
range of size 2^ that has to be injective. Hence we get another obvious lower bound 

Q(slog(n/s)) < log B(n,s) < MEM(p,s,n). 

What about upper bounds? Something that one can always do to construct error-correcting data 
structures for any problem, is to take the optimal non-error-correcting pi-probe construction and 
encode it with a j>2-P r obe LDC. If the error probability of the LDC is much smaller than 1/pi, then 
we can just run the decoder for the non-error-correcting structure, replacing each of its p\ probes 
by p2 probes to the LDC. This gives an error-correcting data structure with p = p\P2 probes. In the 
case of Membership, the optimal non-error-correcting data structure of Buhrman et al. [BMRV02J 
uses only 1 probe and 0{s log n) bits. Encoding this with the best possible p-probe LDC gives error- 
correcting data structures for Membership of length LDC(p, O(slogn)). For instance for p = 2 
we can use the Hadamard cod^l for s bits, giving upper bound MEM(2,s,n) < exp(0(slogn)). 

3 The Hadamard code of a; € {0, 1} S is the codeword of length 2 s obtained by concatenating the bits x ■ y (mod 2) 
for all y £ {0, 1} S . It can be decoded by two probes, since for every y £ {0, 1} S we have (x ■ y) ® (x ■ (y ffi ei)) = Xi. 
Picking y at random, decoding from a 5-corrupted codeword will be correct with probability at least 1 — 25, because 
both probes y and y © d are individually random and hence probe a corrupted entry with probability at most 5. 
This exponential length is optimal for 2-probe LDCs KW04 . 
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Our main positive result in Section [2] says that something much better is possible — the max of 
the above two lower bounds is not far from optimal. Slightly simplifying, we prove 

MEM(p, s, n) < 0(LDC(p, 1000s) log n). 

In other words, if we have a decent p-probe LDC for encoding 0(s)-bit strings, then we can use 
this to encode sets of size s from a much larger universe [n], at the expense of blowing up our 
data structure by only a factor of logra. i For instance, for p = 2 probes we get MEM(2,s,n) < 
exp(0(s)) log n from the Hadamard code, which is much better than the earlier exp(0(slogn)). 
For p = 3 probes, we get MEM(3,s,n) < exp(exp(-^/log s)) log n from Efremenko's recent 3-probe 
LDC [Efr08] (which improved Yekhanin's breakthrough construction [Yek07]). Our construction 
relies heavily on the Membership construction of [BMRV02J. Note that the near-tightness of the 
above upper and lower bounds implies that progress (meaning better upper and/or lower bounds) 
on locally decodable codes for any number of probes is equivalent to progress on error-correcting 
data structures for s-out-of-n Membership. 

1.1.2 Inner product 

In Section [3] we analyze the inner product problem, where we are encoding x £ {0, l} n and want to 
be able to compute the dot product x ■ y (mod 2), for any y £ {0, l} n of weight at most r. We first 
study the non-error-correcting setting, where we can prove nearly matching upper and lower bounds 
(this is not the error-correcting setting, but provides something to compare it with). Clearly, a 
trivial 1-probe data structure is to store the answers to all B(n,r) possible queries separately. In 
Section 13.11 we use a discrepancy argument from communication complexity to prove a lower bound 
of about B(n,r) 1 ^ p on the length of p-probe data structures. This shows that the trivial solution 
is essentially optimal if p = 1. 

We also construct various p-probe error-correcting data structures for inner product. For small 
p and large r, their length is not much worse than the best non-error-correcting structures. The 
upshot is that inner product is a problem where data structures can sometimes be made error- 
correcting at little extra cost compared to the non-error-correcting case — admittedly, this is mostly 
because the non-error-correcting solutions for IP n/r are already very expensive in terms of length. 

1.2 Related work 

Much work has of course been done on locally decodable codes, a.k.a. error-correcting data struc- 
tures for the Membership problem without constraints on the set size [Trc04j. However, the 
error-correcting version of s-out-of-re Membership ("storing sparse tables") or of other possible 
data structure problems has not been studied beforeO Here we briefly describe a number of other 
approaches to data structures in the presence of memory errors. There is also much work on data 
structures with faulty processors, but we will not discuss that here. 

Fault-tolerant pointer-based data structures. Aumann and Bender [AB96] study fault- 
tolerant versions of pointer-based data structures. They define a pointer-based data structure 

4 Our actual result, Theorem [2] is a bit dirtier, with some deterioration in the error and noise parameters. 

5 Using the connection between information-theoretical private information ret rieval and locally decodable codes, 
one may derive some error-correcting data structures from the PIR results of [CIK + 0l] . However, the resulting 
structures seem fairly weak. 
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as a directed graph where the edges are pointers, and the nodes come in two types: information 
nodes carry real data, while auxiliary nodes carry auxiliary or structural data. An error is the de- 
struction of a node and its outgoing edges. They assume such an error is detected when accessing 
the node. Even a few errors may be very harmful to pointer-based data structures: for instance, 
losing one pointer halfway a standard linked list means we lose the second half of the list. They 
call a data structure (d, g) -fault-tolerant (where d is an integer that upper bounds the number of 
errors, and g is a function) if / < d errors cause at most g(f) information nodes to be lost. 

Aumann and Bender present fault-tolerant stacks with g(f) = O(f), and fault-tolerant linked 
lists and binary search trees with <?(/) = 0(f log d), with only a constant-factor overhead in the 
size of the data structure, and small computational overhead. Notice, however, that their error- 
correcting demands are much weaker than ours: we require that no part of the data is lost (every 
query should be answered with high success probability), even in the presence of a constant fraction 
of errors. Of course, we pay for that in terms of the length of the data structure. 



Faulty-memory RAM model. An alternative model of error-correcting data structures is the 
"faulty-memory RAM model", introduced by Finocchi and Italiano |FI04] . In this model, one 
assumes there are 0(1) incorruptible memory cells available. This is justified by the fact that CPU 
registers are much more robust than other kinds of memory. On the other hand, all other memory 
cells can be faulty — including the ones used by the algorithm that is answering queries (something 
our model does not consider). The model assumes an upper bound A on the number of errors. 

Finocchi, Grandoni, and Italiano described essentially optimal resilient algorithms for sorting 
that work in 0(n log n + A 2 ) time with A up to about -y/re; and for searching in ©(logn + A) 
time. There is a lot of recent work in this model: J0rgenson et al. |JMM07| study resilient priority 
queues, Finocchi et al. [FGI07] study resilient search trees, and Brodal et al. |BFF + 07 study resilient 
dictionaries. This interesting model allows for more efficient data structures than our model, but 
its disadvantages are also clear: it assumes a small number of incorruptible cells, which may not 
be available in many practical situations (for instance when the whole data structure is stored on 
a hard disk), and the constructions mentioned above cannot deal well with a constant noise rate. 



2 The Membership problem 

2.1 Noiseless case: the BMRV data structure for Membership 

Our error-correcting data structures for Membership rely heavily on the construction of Buhrman 
et al. [BMRV02], whose relevant properties we sketch here. Their structure is obtained using the 
probabilistic method. Explicit but slightly less efficient structures were subsequently given by Ta- 
Shma [TS02j . The BMRV-structure maps x £ {0, l} n (of weight < s) to a string y := y[x) G {0, l} n ' 
of length n' = ^slogn that can be decoded with one probe if 5 = 0. More precisely, for every 
i £ [re] there is a set Pi C [n'] of size \Pi\ = log (re) /e such that for every x of weight < s: 

Pr [ yj = Xi] >l-e, (1) 

where the probability is taken over a uniform index j £ P{. For fixed e, the length n' = O(slogn) 
of the BMRV-structure is optimal up to a constant factor, because clearly log (") is a lower bound. 
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2.2 Noisy case: 1 probe 



For the noiseless case, the BMRV data structure has information-theoretically optimal length 
0(s log n) and decodes with the minimal number of probes (one). This can also be achieved in 
the error-correcting case if s = 1: then we just have the Equality problem, for which see the 
remark following Definition [2j For larger s, one can observe that the BMRV-structure still works 
with high probability if 5 <C 1/s: in that case the total number of errors is Sn' <C logn, so for each 
i, most bits in the 0(logn)-set Pi are uncorrupted. 

Theorem 1 (BMRV) There exist (l,Q(l/s), 1/4) -error- correcting data structures for Member- 
ship of length N = O(slogn). 

This only works if 5 <C 1/s, which is actually close to optimal, as follows. An s-bit LDC 
can be embedded in an error-correcting data structure for Membership, hence it follows from 
Katz-Trevisan's |KT00i Theorem 3] that there are no 1-probe error-correcting data structures for 
Membership if s > 1/(5(1 — H(e))) (where H(-) denotes binary entropy). In sum, there are 1- 
probe error-correcting data structures for Membership of information-theoretically optimal length 
if S <C 1/s. In contrast, if S 3> 1/s then there are no 1-probe error-correcting data structures at 
all, not even of exponential length. 

2.3 Noisy case: p > 1 probes 

As we argued in the introduction, for fixed e and 5 there is an easy lower bound on the length N 
of p-probe error-correcting data structures for s-out-of-n Membership: 



Our nearly matching upper bound, described below, uses the e-error data structure of [BMRV02J 
for some small fixed e. A simple way to obtain a p-probe error-correcting data structure is just to 
encode their 0(s log n)-bit string y with the optimal p-probe LDC (with error e', say), which gives 
length LDC(p, 0(s log n)). The one probe to y is replaced by p probes to the LDC. By the union 
bound, the error probability of the overall construction is at most e + e' . This, however, achieves 
more than we need: this structure enables us to recover yj for every j, whereas it would suffice if 
we were able to recover yj for most j S Pi (for each iG [n]). 

Definition of the data structure and decoder. To construct a shorter error-correcting data 
structure, we proceed as follows. Let 5 be a small constant (e.g. 1/10000); this is the noise level we 
want our final data structure for Membership to protect against. Consider the BMRV-structure 
for s-out-of-n Membership, with error probability at most 1/10. Then n' = 10000s logn is its 
length, and b = 10 logn is the size of each of the sets Pi. Apply now a random permutation tt to 
y (we show below that 7r can be fixed to a specific permutation). View the resulting n'-bit string 
as made up of b = 10 log n consecutive blocks of 1000s bits each. We encode each block with the 
optimal (p, 1005, 1/100)-LDC that encodes 1000s bits. Let I be the length of this LDC. This gives 
overall length 



N > max 




N = lOt logn. 
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The decoding procedure is as follows. Randomly choose a k G [b]. This picks out one of the blocks. 
If this kth block contains exactly one j G Pj then recover yj from the (possibly corrupted) LDC 
for that block, using the p-probe LDC-decoder, and output yj. If the kth block contains or more 
than 1 elements from Pj, then output a uniformly random bit. 



Analysis. Our goal below is to show that we can fix the permutation tt such that for at least 
n/20 of the indices i G [n], this procedure has good probability of correctly decoding Xi (for all x of 
weight < s). The intuition is as follows. Thanks to the random permutation and the fact that |Pj| 
equals the number of blocks, the expected intersection between Pj and a block is exactly 1. Hence 
for many i £ [n], many blocks will contain exactly one index j G Pj. Moreover, for most blocks, 
their LDC-encoding won't have too many errors, hence we can recover yj using the LDC-decoder 
for that block. Since yj = Xi for 90% of the j G Pj, we usually recover X{. 

To make this precise, call k G [b] "good for i" if block k contains exactly one j G Pi, and let 
Xjfc be the indicator random variable for this event. Call i G [n] "good" if at least 6/4 of the blocks 
are good for i (i.e., Ylke[b] -^ik > and let Xi be the indicator random variable for this event. 
The expected value (over uniformly random tt) of each X^ is the probability that if we randomly 
place b balls into ab positions (a is the block-size 1000s), then there is exactly one ball among the 
a positions of the first block, and the other 6—1 balls are in the last ab — a positions. This is 



{ab-b)(ab-b- 1) ■ ■ ■ (ab - b - a + 2) 



(?) 



(ab-l)(ab-2)---(ab-a + l) 



> 



ab — b — a + 2 
ab — a + 1 



a-1 



> 1- 



a-l 



The righthand side goes to 1/e « 0.37 with large a, so we can safely lower bound it by 3/10. Then, 
using linearity of expectation: 



36n 

To" 



< Exp 


Xik 


< b ■ Exp 




»{ „ 

+ - n - Exp 






ie[n],k€[b] 




ie[n] 


4 \ 


ie[n] 



which implies 



Exp 



i£\n\ 



> 



n 

20' 



Hence we can fix one permutation tt such that at least n/20 of the indices i are good. 

For every index i, at least 90% of all j G Pi satisfy yj = Xj. Hence for a good index i, with 
probability at least 1/4 — 1/10 we will pick a k such that the kth block is good for i and the unique 
j G Pi in the fcth block satisfies yj = x\. By Markov's inequality, the probability that the block 
that we picked has more than a 100(5-fraction of errors, is less than 1/100. If the fraction of errors 
is at most 1005, then our LDC-decoder recovers the relevant bit yj with probability 99/100. Hence 
the overall probability of outputting the correct value Xj is at least 



3 1/11 1 
4'2 + l4~10~100 



99 51 

100 > 100' 



We end up with an error-correcting data structure for Membership for a universe of size n/20 
instead of n elements, but we can fix this by starting with the BMRV-structure for 20n bits. 
We summarize this construction in a theorem: 
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Theorem 2 If there exists a (p, 1005, 1/100)-LDC of length £ that encodes 1000s bits, then there 
exists a (p, 5, 49 /100) -error- correcting data structure of length 0(£logn) for the s-out-of-n Mem- 
bership problem. 

The error and noise parameters of this new structure are not great, but they can be improved 
by more careful analysis. We here sketch a better solution without giving all technical details. 
Suppose we change the decoding procedure for X{ as follows: pick j G Pj uniformly at random, 
decode yj from the LDC of the block where yj sits, and output the result. There are three sources 
of error here: (1) the BMRV-structure makes a mistake (i.e., j happens to be such that yj ^ Xi), 

(2) the LDC-decoder fails because there is too much noise on the LDC that we are decoding from, 

(3) the LDC-decoder fails even though there is not too much noise on it. The 2nd kind is hardest 
to analyze. The adversary will do best if he puts just a bit more than the tolerable noise-level on 
the encodings of blocks that contain the most j £ Pi, thereby "destroying" those encodings. 

For a random permutation, we expect that about b/(e ■ ml) of the b blocks contain m elements 
of Pi. Hence about 1/65 of all blocks have 4 or more elements of Pj. If the LDC is designed to 
protect against a 655-fraction of errors within one encoded block, then with overall error-fraction 
5, the adversary has exactly enough noise to "destroy" all blocks containing 4 or more elements of 
Pj. The probability that our uniformly random j sits in such a "destroyed" block is about 



Hence if we set the error of the BMRV-structure to 1/10 and the error of the LDC to 1/100 (as 
above), then the total error probability for decoding Xj is less than 0.2 (of course we need to show 
that we can fix a i such that good decoding occurs for a good fraction of all i G [n]). Another 
parameter that may be adjusted is the block size, which we here took to be 1000s. Clearly, different 
tradeoffs between codelength, tolerable noise-level, and error probability are possible. 

3 The Inner product problem 

3.1 Noiseless case 

Here we show bounds for Inner product, first for the case where there is no noise (5 = 0). 

Upper bound. Consider all strings z of weight at most \r/p\. The number of such z is 
B(n, \r/p~\) = X^I=o P ^ (T) — {e.pn/r) r / p . We define our codeword by writing down, for all z in 
lexicographic order, the inner product x ■ z mod 2. If we want to recover the inner product x ■ y for 
some y of weight at most r, we write y = z\ + • • • + z v for Zj's of weight at most \r/p\ and recover 
x ■ Zj for each j G [p], using one probe for each. Summing the results of the p probes gives x ■ y 
(mod 2). In particular, for p = 1 probes, the length is B(n,r). 

Lower bound. To prove a nearly-matching lower bound, we use Miltersen's technique of relating 
a data structure to a two-party communication game |Mil94| . We refer to |KN97j for a general 
introduction to communication complexity. Suppose Alice gets string x G {0, l} n , Bob gets string 
y G {0, 1}™ of weight < r, and they need to compute x ■ y (mod 2) with bounded error probability 
and minimal communication between them. Call this communication problem IP„ )r . Let B(n, r) = 
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^4=0 (") °e the size of Q, i.e., the number of possible queries y. The proof of our communication 
complexity lower bound below uses a fairly standard discrepancy argument, but we have not found 
this specific result anywhere. For completeness we include a proof in Appendix [Al 

Theorem 3 Every communication protocol for IP nr with worst-case (or even average-case) success 
probability > 1/2 + (3 needs at least \og(B(n,r)) — 21og(l/2/3) bits of communication. 

Armed with this communication complexity bound we can lower bound data structure length: 

Theorem 4 Every (p,e)-data structure for IP n , r needs space N > ^(^^(".^-^(i/a^MVp 

Proof. We will use the data structure to obtain a communication protocol for IP„, jr that uses 
p(\og(N) + 1) + 1 bits of communication, and then invoke Theorem [3] to obtain the lower bound. 

Alice holds x, and hence (j)(x), while Bob simulates the decoder. Bob starts the communication. 
He picks his first probe to the data structure and sends it over in log N bits. Alice sends back 
the 1-bit answer. After p rounds of communication, all p probes have been simulated and Bob 
can give the same output as the decoder would have given. Bob's output will be the last bit of 
the communication. Theorem [3] now implies p(log(N) + 1) + 1 > Iog(B(n, r)) — 21og(l/(l — 2e)). 
Rearranging gives the bound on N. □ 

For fixed e, the lower bound is N = £1 (B(n, r) 1 ^) . This is Q((n/r) r//p ), which (at least for small 
p) is not too far from the upper bound of approximately (epn/r) r ^ p mentioned above. Note that in 
general our bound on N is super polynomial in n whenever p = o{r). For instance, when r = an for 
some constant a € (0, 1/2) then N = n(2 nH( - a ^ p ), which is non-trivial whenever p = o(n). Finally, 
note that the proof technique also works if Alice's messages are longer than 1 bit (i.e., if the code 
is over a larger-than-binary alphabet). 

3.2 Noisy case 

3.2.1 Constructions for Substring 

One can easily construct error-correcting data structures for Substring, which also suffice for 
Inner product. Note that since we are recovering r bits, and each probe gives at most one bit 
of information, by information theory we need at least about r probes to the data structurell Our 
solutions below will use 0(r log r) probes. View x as a concatenation x = x^ . . . x^ of r strings 
of n/r bits each (we ignore rounding for simplicity), and define (ft(x) as the concatenation of the 
Hadamard codes of these r pieces. Then (j){x) has length N = r ■ 2 n /' r . 

If 5 > l/4r then the adversary could corrupt one of the r Hadamard codes by 25% noise, ensuring 
that some of the bits of x are irrevocably lost even when we allow the full probes. However, if 
5 <^ l/r then we can recover each bit Xi with small constant error probability by 2 probes in the 
Hadamard codeword where i sits, and with error probability <C l/r using O(logr) probes. Hence 
we can compute f(x,y) = x y with error close to using p = 0(r log r) probes (or with 2r probes if 
5 -C l/r 2 )|3 This also implies that any data structure problem where f(x,q) depends on at most 
some fixed constant r bits of x, has an error-correcting data structure of length N = r ■ 2 n l r , p = 

6 d/(log(7V) + 1) probes in the case of quantum decoders. 

7 It follows from Buhrman et al. [BNRW07| that if we allow a quantum decoder, the factor of log r is not needed. 
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0(r logr), and that works if 5 -C 1/r. Alternatively, we can take Efremenko's |Efr08j or Yekhanin's 
3-probe LDC |Yek07] , and just decode each of the r bits separately. Using 0(log r) probes to recover 
a bit with error probability <C we recover the r-bit string x y using p = 0(r logr) probes even 
if 5 is a constant independent of r. 

3.2.2 Constructions for Inner product 

Going through the proof of |Yek07| , it is easy to see that it allows us to compute the parity of any 
set of r bits from x using at most 3r probes with error e, if the noise rate 5 is at most e/(3r) (just 
add the results of the 3 probes one would make for each bit in the parity) . To get error-correcting 
data structures even for small constant p (independent of r), we can adapt the polynomial schemes 
from [BIK05] to get the following theorem. The details are given in Appendix IB"1 

Theorem 5 For every p > 2, there exists a (p, 5, p5)- error- correcting data structure for IP njr of 
length N <p ■ 2 r (p-i) 2 « 1/(p - 1) . 

For the p = 2 case, we get something simpler and better from the Hadamard code. This code, 
of length 2 n , actually allows us to compute x ■ y (mod 2) for any y £ {0, l} n of our choice, with 2 
probes and error probability at most 25 (just probe z and y © z for uniformly random z G {0, l} n 
and observe that {x ■ z) © {x ■ (y © z)) = x ■ y). Note that for r = Q(n) and p = O(l), even 
non-error-correcting data structures need length 2®( n ) (Theorem U]) . This is an example where 
error-correcting data structures are not significantly longer than the non-error-correcting kind. 

4 Future work 

Many questions are opened up by our model of error-correcting data structures. We mention a few: 

• There are plenty of other natural data structure problems, such as Rank, Predecessor, 
versions of Nearest neighbor etc. (Mil99j. What about the length- vs-probes tradeoffs for 
their error-correcting versions? The obvious approach is to put the best known LDC on top of 
the best known non-error-correcting data structures. This is not always optimal, though — for 
instance in the case of s-out-of-n Membership one can do significantly better, as we showed. 

• It is often natural to assume that a memory cell contains not a bit, but some number from, 
say, a polynomial-size universe. This is called the cell-probe model |Yao81j . in contrast to 
the bit-probe model we considered here. Probing a cell gives O(logn) bits at the same time, 
which can significantly improve the length-vs-probes tradeoff and is worth studying. Still, 
we view the bit-probe approach taken here as more fundamental than the cell-probe model. 
A p-probe cell-probe structure is a 0(p log n) -probe bit-probe structure, but not vice versa. 
Also, the way memory is addressed in actual computers in constant chunks of, say, 8 or 16 
bits at a time, is closer in spirit to the bit-probe model than to the cell-probe model. 

• Zvi Lotker suggested to me the following connection with distributed computing. Suppose 
the data structure is distributed over N processors, each holding one bit. Interpreted in this 
setting, an error-correcting data structure allows an honest party to answer queries about 
the encoded object while communicating with at most p processors. The answer will be 
correct with probability 1 — e, even if up to a effraction of the iV processors are faulty or even 
malicious (the querier need not know where the faulty /malicious sites are). 
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A Proof of Theorem [3] 

Let /j, be the uniform input distribution: each x has probability l/2 n and each y of weight < r has 
probability 1/B(n,r). We show a lower bound on the communication c of deterministic protocols 
that compute IP njr with /x-probability at least 1/2 +/3. By Yao's principle |Yao77] . this lower bound 
then also applies to randomized protocols. 

Consider a deterministic c-bit protocol. Assume the last bit communicated is the output bit. It 
is well-known that this partitions the input space into rectangles R±, . . . , R^, where Ri = A{ x Bi, 
and the protocol gives the same output bit for each (x,y) S Rilo The discrepancy of rectangle 
R = A x B under /i is the difference between the weight of the 0s and the Is in that rectangle: 

s^R) = |/x(i?nip- 1 r (i))-M(i?nip- 1 r (o))| 

We can show for every rectangle that its discrepancy is not very large: 

8 |KN971 Section 1.2]. The number of rectangles may be smaller than 2 C , but we can always add empty ones. 
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/\R\ 

Lemma 1 SJR) < 



^B(n,r) 



Proof. Let M be the 2 n x B(n,r) matrix whose (rr,y)-entry is (— l) IP ™.<-0,:i/) = (— l) x ~y. It is easy 
to see that M T M = 2 n I, where I is the B(n,r) x B(n,r) identity matrix. This implies, for any 

|| Mv || 2 = (Mv) T ■ (Mv) = v T M T Mv = 2 n v T v = 2 n \\ v f. 

Let R = A x B, va € {0, l} 2 ™ and u# £ {0, l} B ( n ' r ) be the characteristic (column) vectors of 
the sets ^4 and 5. Note that 11^11 = \f\A\ and || v B \\ = y/\B\. The sum of M-entries in R is 
YlaeA beB M a b = v t a Mvb- We can bound this using Cauchy-Schwarz: 

\v A Mv B \ < || v A || • || Mv B || = || v A || • v^H v B || = \/\A\ ■ \B\ ■ 2 n . 
Observing that S^R) = \vj±Mv B \/ (2 n B(n,r)) and \R\ = \A\ ■ \B\ concludes the proof. □ 

Define the success and failure probabilities (under fi) of the protocol as 

P S = Y1 n IP n!M)) and P } = ^ n IP-^l - en)) 



Then 



i=l i=l 

2[3 < Ps-Pf 

= ^ KPi n IPn,r(at)) " A»C-Ri n IPn,r(l ~ 

i 



y/2 n B(n,r) v 2 n B{n,r) 

where the last inequality is Cauchy-Schwarz and the last equality holds because Y2i \ Pi\ ^ s the total 
number of inputs, which is 2 n B(n,r). 

Rearranging gives 2 C > (2[5) 2 B(n,r), hence c > log(B(n, r)) — 21og(l/2/3). 

B Proof of Theorem [5] 

Here we construct p-probe error-correcting data structures for the inner product problem, inspired 
by the approach to locally decodable codes of [BIK05J. Let d be an integer to be determined later. 
Pick m = \dn 1 ' d ^ . Then (™) > n, so there exist n distinct sets Si, . . . , S n C [m], each of size d. 
For each x G {0, 1}™, define an m-variate polynomial p x of degree d over F2 by 

n 

Px (#1 j • • • j ) — ^ ^ J^J %j ■ 

i=l ie5i 
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Note that if we identify Si with its m-bit characteristic vector, then p x (Si) = X{. For z^ l \ . . . , G 
{0, l} m , define an rm- variate polynomial p x>r over F2 by 

r 

A , r (zW,...,zW)=^ P:c (^). 

j'=i 

This polynomial p Xir (z) has rm variables, degree d, and allows us to evaluate parities of any set of 
r of the variables of x: if y £ {0, l} n (of weight r) has its 1-bits at positions ii, . . . , i r , then 

r 

Px,r( S h>- ■ ■ > S i r ) = ^Z x i, = X 'V ( mod 2 ). 
i=i 

To construct an error-correcting data structure for IPn,r; it thus suffices to give a structure that 
enables us to evaluate p X)T at any point w of our choiceo 

Let w £ {0, l} rm . Suppose we "secret-share" this into p pieces . . . ,w&> G {0, l} rm which 
are uniformly random subject to the constraint w = w^- 1 ' + • • ■ + w^ p \ Now consider the prm- variate 
polynomial q x>r defined by 

q x , r ( W W , . . . , >)) = p x ^( w W +... + W (P)). ( 2 ) 

Each monomial M in this polynomial has at most d variables. If we pick d = p — 1, then for 
every M there will be a j G [p] such that M does not contain variables from tow) . Assign all such 

monomials to a new polynomial which is independent of w^K This allows us to write 

q x , r (w^,. . .,w( p) ) = ^ (2) , • . .,w( p) ) + ■■■ + qi p l(w {1 \. . .,w^). (3) 

Note that each q Xt l has domain of size 2( p-1 ) rm . The data structure is defined as the concatenation, 
for all j G [p] , of the values of qi} r on all possible inputs. This has length 

N = p ■ 2( p-1 ) rm = p ■ 2 r (p- 1 ) 2nl/(p_1) . 

This length is 2°( rnl/(p_1) ) for p = O(l). 

Answering a query works as follows: the decoder would like to evaluate p XyT on some point 
w G {0, l} rm . He picks un 1 ) , . . . as above, and for all j G [p], in the jth block of the code 

probes the point zW, . . . , z^~^\ z^ +1 \ . . . , z^ . This, if uncorrupted, returns the value of q x } r at 
that point. The decoder outputs the sum of his p probes (mod 2). If none of the probed bits were 
corrupted, then the output is p x , r {w) by Eqs. ([2]) and ([3]). Note that the probe within the jth block 
is uniformly random in that block, so its error probability is exactly the fraction 5j of errors in the 
jth block. Hence by the union bound, the total error probability is at most Yl P =i ^ the overall 
fraction of errors in the data structure is at most 5, then we have - Y^=i $j — ^> hence the total 
error probability is at most p5. 



9 If we also want to be able to compute x ■ y (mod 2) for \y\ < r, we can just add a dummy as (n + l)st variable 
to x, and use its index r — \y\ times as inputs to p x ,r- 
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